You are viewing the RapidMiner Studio documentation for version 10.2 - Check here for latest version
 Process Documents from Data
						(Text Processing)
Process Documents from Data
						(Text Processing)
					
	
		
		
		Synopsis
Generates word vectors from string attributes.Input
 word list word list- The word list port. 
 example set (Data Table) example set (Data Table)- The example set port. 
Output
 example set (Data Table) example set (Data Table)- The example set port. 
 word list word list- The word list port. 
Parameters
- create_word_vectorIf checked, the tokens of a document will be used to generate a vector numerically representing the document. Range:
- vector_creationSelect the schema for creating the word vector. Range:
- add_meta_informationIf checked, available meta information of the text like filename, date is added as attribute. Range:
- keep_textIf checked, the input text will be stored as a special String attribute with the role text. Range:
- prune_methodSpecifies if to frequent or to infrequent words should be ignored for word list building and how the frequencies are specified. Range:
- prune_below_percentIgnore words that appear in less than this percentage of all documents. Range:
- prune_above_percentIgnore words that appear in more than this percentage of all documents. Range:
- prune_below_absoluteIgnore words that appear in less than that many documents. Range:
- prune_above_absoluteIgnore words that appear in more than that many documents. Range:
- prune_below_rankWords are ordered by frequency and words with a frequency less than the frequency of the rank given by this percentage will be pruned. Range:
- prune_above_rankWords are ordered by frequency and words with a frequency higher than the frequency of the rank given by this percentage will be pruned. Range:
- datamanagementDetermines, how the data is represented internally. Range:
- select_attributes_and_weightsIf checked, you might select the used text attributes and their weights. Otherwise all text attributes are used. Range:
- specify_weightsThis parameters allows to set weights per attribute. Text from attributes with higher weight will be more imporant during analysis. Range:
- parallelize_vector_creationDetermines whether the execution of Vector Creation should be parallelized. Range:
